AMENDMENTS TO THE SPECIFICATION 



Please replace the Abstract with the following amended Abstract: 

[0122] A multiprocessor switching device substantially implemented on a single CMOS 
integrated circuit is described in connection with a packet data transfer circuit that uses a 
fragment storage buffer to align and/or merge data being transferred to or from memory on a 
plurality of channels. In a packet reception embodiment, a data shifter and fragment store buffer 
are used to align received packet data to any required offset. The aligned data may and then be 
written to the system bus or combined with data fragments from prior data cycles before being 
written to the system bus. When packet data is being transferred to memory on a plurality of 
channels, the fragment storage may be channelized using register files or flip-flops to store 
intermediate values of packets and states for each channel. 

Please replace paragraph [023] beginning on page 6 with the following amended 
paragraph: 

[023] As shown in Figure 1, the four processors 102, 106, 110, 114 are joined to the 
internal bus 130. When implemented as standard MIPS64 cores, the processors 102, 106, 110, 

1 14 have floating-point support, and are independent, allowing applications to be migrated from 
one processor to another if necessary. The processors 102, 106, 110, 114 may be designed to 
any instruction set architecture, and may execute programs written to that instruction set 
architecture. Exemplary instruction set architectures may include the MIPS instruction set 
architecture (including the MIPS-3D and MIPS MDMX application specific extensions), the IA- 
32 or IA-64 instruction set architectures developed by Intel Corp., the PowerPC instruction set 
architecture, the Alpha instruction set architecture, the ARM instruction set architecture, or any 
other instruction set architecture. The system 100 may include any number of processors (e.g., 
as few as one processor, two processors, four processors, etc.). In addition, each processing unit 
102, 106, 110, 114 may include a memory sub-system (level 1 cache) of an instruction cache and 
a data cache and may support separately, or in combination, one or more processing functions. 
With respect to the processing system example of Figure 2, each processing unit 102, 106, 1 10, 
444 260, 261, 262, 263 may be a destination within multiprocessor device 1 00 and/or each 
processing function executed by the processing modules 10^106, 110, 114 260, 261, 262, 263 
may be a source within the processor device 100. 
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Please replace paragraph [024] on page 7 with the following amended paragraph: 

[024] The internal bus 130 may be any form of communication medium between the 
devices coupled to the bus. For example, the bus 130 may include shared buses, crossbar 
connections, point-to-point connections in a ring, star, or any other topology, meshes, cubes, etc. 
In selected embodiments, the internal bus 130 may be a split transaction bus (i.e., having 
separate address and data phases). The data phases of various transactions on the bus may 
proceed out of order with the address phases. The bus may also support coherency and thus may 
include a response phase to transmit coherency response information. The bus may employ a 
distributed arbitration scheme, and may be pipelined. The bus may employ any suitable 
signaling technique. For example, differential signaling may be used for high speed signal 
transmission. Other embodiments may employ any other signaling technique (e.g., TTL, CMOS, 
GTL, HSTL, etc.). Other embodiments may employ non-split transaction buses arbitrated with a 
single arbitration for address and data and/or a split transaction bus in which the data bus is not 
explicitly arbitrated. Either a central arbitration scheme or a distributed arbitration scheme may 
be used, according to design choice. Furthermore, the bus may not be pipelined, if desired. In 
addition, the internal bus 130 may be a high-speed (e.g., 128-Gbit/s) 256 bit cache line wide split 
transaction cache coherent multiprocessor bus that couples the processing units 102, 106, 110, 
114, cache memory 118, memory controller 122 (illustrated for architecture purposes as being 
connected through cache memory 1 1 8), node controller 134 and packet manager 148 together. 
The bus 130 may run in big-endian and little-endian modes, and may implement the standard 
MESI protocol to ensure coherency between the four CPUs, their level 1 caches, and the shared 
level 2 cache 118. In addition, the bus 130 may be implemented to support all on-chip 
peripherals (e.g., 265 in Figure 21. including a PCI/PCI-X interface 126 and the input/output 
bridge interface 156 for the generic bus, SMbus, UARTs, GOIP and Ethernet MAC. 

Please replace paragraph [025] beginning on page 7 with the following amended 
paragraph: 

[025] The cache memory 1 1 8 258 may function as an L2 cache for the processing units 
102, 106, 1 10, 1 14, node controller 134 and/or packet manager 148. With respect to the 
processing system example of Figure 2, the cache memory 44-8 258 may be a destination within 
multiprocessor device 4-00 215 . 
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Please replace paragraph [032] on page 11 with the following amended paragraph: 

[032] The system controller 1 52 (or 346 in Figure 3) is coupled to provide interrupts to 
the interrupt lines in processors 102, 106, 110, 114 and is further coupled to receive interrupt 
requests from system modules (such as packet manager 152 or packet-based interfaces 162, 166, 
170 illustrated in Fig. 1) and from other devices within the system 100 (not shown). In an 
alternative embodiment described herein, the interrupt mapping function may instead or in 
addition be provided in the various system modules that generate interrupts, such as with an 
interrupt mapper 370 in the packet manager 442 320 or with an address map 339 in the packet- 
based interfaces 162, 166, 170 330. 331. 332 illustrated in Fig. 4- 3. The system controller 1 52 or 
346 may map each interrupt to one of the interrupt lines of processors 102, 106, 110, 114, and 
may assert an interrupt signal to the selected processor 102, 106, 110, 114. The processors 102, 
106, 110, 114 may access the system controller 152 to determine the source of a given interrupt. 
The system controller 1 52 or 346 may employ any mapping mechanism. In one embodiment, 
the system controller 1 52 may comprise a channel register and a source register to map each 
interrupt request to each processor 102, 106, 110, 114. The channel register identifies to the 
processor which channels are generating interrupts, and the source register indicates the real 
source of a channel’s interrupt. By using a programmable interrupt controller in the packet 
manager 320 with interrupt channel and source information stored in configuration status 
registers, the interrupt mapper 370 can mask events and vector interrupts to their final destination 
using at most two CSR read operations by the processor, although additional mapping 380 can be 
done in the system controller 152 or 346 . 

Please replace paragraph [035] on page 13 with the following amended paragraph: 

[035] The tight coupling may be manifest in several fashions. For example, the 
interrupts may be tightly coupled. An I/O device (e.g., the packet interface circuits 162, 166, 

170) may request an interrupt which is mapped (via an interrupt map 370 or 380 in the packet 
manager 320 or system controller 346 shown in Figure 3) to one of the processors 102, 106, 1 10, 

1 14. The transmission of the interrupt to the processor may be rapid since the signals may be 
transmitted at the clock frequency of the integrated circuit comprising the system 1 00 (as 
opposed to interconnecting separate integrated circuits). When the processor (e.g., 102) executes 
the interrupt service routine, typically one or more status registers in the system controller 152 
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and/or the interrupting device are read. These status register reads may occur with relatively low 
latency across the bus 130 (as opposed to, for example, a high latency peripheral bus such as 
PCI). The latency of the status register reads may, in some embodiments, be one or more orders 
of magnitude less than that of a peripheral bus such as PCI. 

Please replace paragraph [36] on page 13 with the following amended paragraph: 

[036] As will be understood, the multiprocessor device 100 of the present invention 
provides multiprocessing functionality on its own which makes it suitable for scientific and 
embedded applications requiring significant computational capabilities. In a selected 
embodiment, the multiprocessor device (e.g., 100 . 215) of the present invention contains a 
number of peripherals along with its sophisticated memory and communication support. For 
example, in a selected embodiment, the processor cores (e.g., 102 or 260) are .8 to 1.2-GHz, 64- 
bit MIPS with 64 kbytes of level one cache memory per processor and 1 Mbyte of level two 
cache 118 or 258 per chip; an 800-MHz DDR controller 122 or 264 ; off-chip ccNUMA support 
and optional ECC support. Three 8/1 6-bit receive/transmit ports 162, 166, 170 (or 250. 252, 

254) are also provided that are configurable as either HyperTransport or SPI-4 links. Additional 
peripheral features 265 include a 32-bit 33/66-MHz PCI interface or 64-bit 133 MHz PCI/x 
interface 126; an input/output bridge 156 that includes a 10/100/1000 Ethernet MAC interface, 
general-purpose I/O ports, SMBus serial interfaces and four DUARTs. 

Please replace paragraph [040] beginning on page 14 with the following amended 
paragraph: 

[040] Figure 2 depicts an example multiprocessor switch application 200 of the present 
invention showing how the HyperTransport/SPI-4 link architecture can be used in 
communication and multichip multiprocessing support. As illustrated, each link (e.g., 250, 252, 
254) can be configured as an 8- or 16-bit HyperTransport connection, or as a streaming SPI-4 
interface. In addition, each link includes hardware hash and route acceleration functions, 
whereby routing information for an incoming packet are calculated. The routing information 
determines how a packet will steer through the internal switch (e.g., 256) of a multiprocessor 
device (e.g., 205, 210. 215 , 220, 225, 230) . The destination through the switch can be either an 
output port or the packet manager input. Generally speaking, the steering is accomplished by 
translating header information from a packet (along with other input data) to an output virtual 
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channel (OVC). In addition, the HyperTransport links (e.g., 250, 252, 254) work with a mix of 
HyperTransport transactions, including encapsulated SPI-4 packets and nonlocal NUMA 
memory access. 

Please replace paragraph [044] on page 16 with the following amended paragraph: 

[044] As depicted, the network and system chip 300 includes an on-chip five-port 
switch 310 that connects a node controller (shown in Figure 1 as node controller 134) and packet 
manager 320 to three high-speed transmit/receiver circuits 330-332, 350-352. Software resident 
in the memory 340 and processors 342, 343, 344 , 345 may process and modify incoming 
packets, may require direct storage in memory 340 without modification, and may generate 
packets for transmission via transmitter circuits 350-352. The node controller manages 
HyperTransport (HT) transactions and remote memory accesses for the cache coherent, 
distributed-shared-memory model of the system. The packet manager 320 provides hardware 
assisted packet processing capabilities, such as DMA engines, channel support, multiple 
input/output queues, TCP/IP checksum functions, and output scheduling. The high-speed 
receiver and transmitter circuits can operate in one of two modes; HT or SPI-4 Phase 2. The 16- 
bit HT mode allows connection to companion multiprocessor devices in a daisy-chain 
configuration, to HyperTransport bridge chips (e.g., 202, 204) for additional I/O devices (e.g., 
PCI-X peripherals 201, InfiniBand Fabric chips 203), or to an external switch for scalable 
bandwidth applications. The SPI-4 mode is intended for direct connection to physical layer 
network devices - e.g., 10 GE MAC, OC-192 SONET framer, or to an application specific 
(ASIC) chip that provides customer enabled network functions. 

Please replace paragraph [074] on page 22 with the following amended paragraph: 

[074] The hash and route (H&R) block 335 makes all of the routing decisions for 
ingress packets from the high-speed receiver ports 330-332 by calculating, for each packet, an 
output virtual channel (OVC) which is used for internal switching on the multiprocessor device 
300. The packets are then sent to either the packet manager input (PMI) 322 or to one of the 
transmit ports 350-352. The H&R module 335 is located in each of the three high-speed receiver 
ports 330-332. As a packet 301 enters the receiver port (e.g., 330), it is decoded and control 
information is extracted by the receiver interface or decoder 333. The H&R module 335 
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calculates the routing result by using this control information along with the packet data and 
several programmable tables in the H&R module 335. Routing information is encoded in the 
form of a switch or output virtual channel (OVC) which is used by the on-chip switch 310 to 
route packets. The OVC describes the destination module, such as the PMI 322 or transmitter 
ports 350-352, and either the input queue number (IQ) in the case of the PMI or the output 
channel in the case of the transmitter ports. When targeting the packet manager 320, the output 
virtual channel corresponds directly to IQs. On the output side, the packet manager 320 maps an 
OQ into one OVC which always corresponds to a transmitter port. In addition, multiple sources 
can send packets to a single destination through the switch. If packets from different sources 
(receivers 330, 331, 332 or PMO 324) are targeted at the same output VC of a transmitter port or 
the IQ of the PMI 322, the switch 310 will not interleave chunks of packets of different sources 
in the same VC. Both the packet data and its associated route result are stored in the receiver 
buffer 338 before the packet is switched to its destination. The H&R module 335 can be 
implemented by the structures disclosed in copending U.S. patent application entitled “Hash and 

Route Hardware With Parallel Routing Scheme” by L. Moll, Ser. No.[[ ]] 10/684,871. 

filed [[ ]] 10/14/03. and assigned to Broadcom Corporation, which is also the 

assignee of the present application, and is hereby incorporated by reference in its entirety. 

Please replace paragraph [085] on page 26 with the following amended paragraph: 

[085] The input queues of the PMI 322 and the output queues of the PMO 324 may be 
logical queues. That is, the queues may actually be implemented in system memory. The PMI 
322 and the PMO 324 may include buffers to buffer the packet data being transmitted to and 
from the system memory. The queues may be implemented in any fashion. In one particular 
embodiment, each queue is implemented as a descriptor ring (or chain) which identifies memory 
buffers to store packet data corresponding to a given input queue. Additional details concerning 
the use of descriptors to control packet memory transfer operations are disclosed in copending 
U.S. patent applications entitled “Descriptor Write Back Delay Mechanism to Improve 

Performance” by K. Oner, Ser. No. [[ ,]] 10/685.137. filed October 14. 2003. 

“Exponential Channelized Timer” by K. Oner, Ser. No. [[ ]] 10/684.916, filed October 

14. 2003 , and “Descriptor-Based Load Balancing” by K. Oner and J. Dion, Ser. No. 

[[ .11 10/684.614. filed October 14, 2003. now U.S. Patent No. 6.981.074, issued 
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December 27, 2005 , each of which was fil e d - on Octob e r H, 2003, and is assigned to Broadcom 
Corporation, which is also the assignee of the present application, and each of which is hereby 
incorporated by reference in its entirety. In other embodiments, the queues may be implemented 
in any desired fashion (e.g., linked lists, contiguous memory locations for the packet memory 
buffers, etc.). The PMI 322 and the PMO 324 may generate read and write commands to fetch 
and update descriptors. 

Please replace paragraph [091] beginning on page 28 with the following amended 
paragraph: 

[091] Generally, the control circuit 584 may generate read commands to the 
interconnect interface circuit 580 to prefetch descriptors into the descriptor buffer 586. 
Additionally, the control circuit 584 may generate write commands to the interconnect interface 
circuit 580 to write data from the input buffer 588 to the memory buffer, and to write the 
descriptor back to memory after the descriptor has been used to store packet data. The 
interconnect interface circuit 580 may transmit the commands on the bus 130 and, in the case of 
reads, return data to the descriptor buffer 586. In one embodiment, the bus 130 may perform 
cache block sized transfers (where a cache block is the size of a cache line in caches within the 
system 100, e.g. 32 bytes in one embodiment). In such embodiments, if a write command does 
not write the entire cache block, the interconnect interface circuit 580 may perform a read- 
modify-write operation to perform the write. As will be appreciated, a read-modify-write 
operation requires a delay while the cache line being written to is retrieved or read from memory 
over the system bus 1 30 so that it can be merged with (or written over in part by) the new data 
for the cache line. In one embodiment, descriptors may occupy one half of a cache block. In 
such embodiments, the packet manager circuit 5 1 6 may attempt to delay the write back of the 
first descriptor of a cache block to allow the second descriptor to also be written together (thus 
avoiding a higher latency read-modify-write operation). The delay may be fixed or 
programmable, and the first descriptor may be written using a read-modify-write operation if the 
delay expires without a write of the second descriptor. The second descriptor may subsequently 
be written using a read-modify-write operation as well. Because the system can not wait 
indefinitely for additional descriptors to be released, a programmable timer 530 (or 375 in Figure 
3) is provided for controlling the delay. In selected embodiments, multiple timers 375 may be 
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provided, such as a timer for descriptor write back operations and a timer for the interrupt 
operations. This can be replicated in both the PMI 540 and the PMO 542. 

Please replace paragraph [098] on page 31 with the following amended paragraph: 

[098] Turning now to Figure 6, the transfer of two packets (Packetl and Packet2) to and 
from system memory using single and multiple descriptors is illustrated for both the PMI 322 
and PMO 324. As depicted, descriptors 601-604 represent an output queue 610-613 ready for 
transmission, as indicated by the hardware bits (HW) being set to “1.” Descriptors 651-654 
represent an input queue 660-663 that the packet manager 320 has just written to memory (e.g., 
memory 340 or cache memory 1 1 8), as indicated by the hardware bits (HW) being set to “0.” 

For both input and output packets, the first packet (e.g., first output packet 605) is small enough 
to fit in a single descriptor (e.g., 601). With such packets, the output descriptor (Descrl 601) has 
the EOP and the SOP bits set. Likewise, the input descriptor (e.g., Descrl 65 1) has both its SOP 
and EOP bits set. In the input queue, the length field (Lenl) of the first descriptor (Descrl 651) 
is updated with the correct packet length (Lenl') after the packet is received by packet manager 
320. 



Please replace paragraph [0100] on page 32 with the following amended paragraph: 

[0100] In connection with the present invention, it is also significant to note that the long 
packet 656 (Packet2) is well over 32B in length, which requires that multiple 16B chunks of data 
received from the switch 140 be combined or merged as part of buffer storage 660-663 through 
the PMI. 



Please replace paragraph [0107] on page 34 with the following amended paragraph: 

[0107] Figure 8 illustrates a selected embodiment where the input FIFO buffer 71 for a 
channel is implemented with a read pointer 8 1 and a write pointer 82 . each of which has a write 
address 81a. 82a and a read address 81b. 82b . These pointers are used to point to the entry at the 
head and at the tail of the FIFO 84, respectively, for a given channel. When there are multiple 
channels 86, pointers for each channel’s buffer may be incremented 80. 83 and stored in a one 
read register file 87 or a write register file. 
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Please replace paragraph [0116] beginning on page 37 with the following amended 
paragraph: 

[01 16] Turning now to Figure 9, the alignment data path for the packet manager output 
circuit 542 is depicted, whereby data from the memory 340 is transmitted via system bus 130 and 
packet manager to the switch 140. As illustrated, the PMO align/merge/split circuit 90 does the 
reverse of the data alignment function of the PMI alignment and merge circuit 70. That is, data 
fetched from memory 340 arrives from the bus 130 having a memory line width of 32B. Since 
the switch interface 582 is 16B wide, data is converted to 16B chunks by data buffers 91, 92 and 
MUX 93 in the interconnect interface 580. As with the operation of the PMI 540, operation of 
the PMO 542 is controlled by descriptors that specify where the data that is to be transferred is 
located in the memory 340. The descriptors may specify further that each retrieved buffer can 
have any offset within a cache or memory line from memory. In addition, when buffers 94 
retrieved from memory 340 are smaller than 16B in size, it is possible that after fetching a line 
from memory 340, there may not be enough data to send out. In this case, the PMO align/split 
circuit 90 may write partial results into the output buffer 99, or alternatively may accumulate 
1 6B of data before writing them into the output buffer 99. The output buffer 99 may be 
implemented in a similar fashion as the input buffer 71, as depicted in Figure 8. 

Please replace paragraph [0117] on page 38 with the following amended paragraph: 

[0117] To implement the data accumulate function, the PMO align/split circuit 90 first 
fetches a memory line of 32B from the bus 130 and stores the line in the buffers 91, 92 in the 
interconnect interface 580. By providing a 16B bus between the interconnect interface 580 and 
the PMO 542, wire congestion is reduced. Since the width of the switch 140 is also 16B, the 
performance of the PMO 542 is not affected. As depicted in Figure 9, the interconnect interface 
580 splits the retrieved memory line into two 16B chunks and sends them separately to the PMO 
542 through MUX 93 . If data alignment is required to in effect remove an offset from the buffer 
storage 94, the PMO align/split circuit 90 shifts the data to right using a barrel or rotate shifter 
95. The generated data (point A) is merged 97 and stored in the Fragment Store Buffer (FSB) 

98, unless there is enough data to write to the output buffer 99. In the next data beat, the new 
data from the interconnect interface 580 will be shifted and merged through MUX 96 with the 
data from the FSB 98 (point B) to possibly create a full 16B of data (point C). 
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Please replace paragraph [0118] on page 38 with the following amended paragraph: 

[0118] Where very small buffers are described by the output descriptors, it is possible 
that, even after two data cycles, there are less than 1 6B of data available for transmission and the 
end of the packet is not reached. In this case, then generated data (point C) is merged with 
generated data (point A1 at MUX 97 and rewritten into the FSB register 98 and is re-used in the 
next data cycle. This recirculation of data can continue until either the end of the packet is 
reached or 16B of data is accumulated to write to the output buffer 99. After all data for a packet 
is read from the memory 340, any remaining data left in FSB 98 will be flushed out as well. 
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